Introduction

Author: Alessandro Arnone

Akadelivers es una empresa de reparto a domicilio especializada en la entrega de paquetes en menos de 1 hora, lo que se denomina (Q-commerce = Quick commerce) Esta empresa tiene una aplicación móvil con la que sus usuarios pueden elegir entre un catálogo de productos de tiendas locales de su ciudad y que les sean entregados en menos de 10 minutos a la dirección que deseen.

Cuando un usuario pide un pedido a través de Akadelivers se le cobra directamente el coste total (coste del producto + gastos de servicio + gastos de envío). Una vez el usuario ha pagado un producto, el repartidor que se encuentre más próximo a la tienda que tiene el producto se acerca a esta, paga el producto, lo recoge y lo lleva a la dirección que el usuario ha elegido. Akadelivers se lo llevara a la dirección indicada.

alt text

Data Dictionary

order_id: Número de identificación del pedido.

local_time: Hora local a la que se realiza el pedido.

country_code: Código del pais en el que se realiza el pedido.

store_address: Número de tienda en a la que se realiza el pedido.

payment_status: Estado del pedido.

n_of_products: Número de productos que se han comprado en ese pedido.

products_total: Cantidad en Euros que el usuario ha comprado en la app.

final_status: Estado final del pedido (este será la variable 'target' a predecir) que indicara si el pedido será finalmente entregado o cancelado. Hay dos tipos de estado:

Objective

Summary & Results

In this section I want to give immediatly and insights about the results and the methodology used.

Questions

  1. TOP 3 COUNTRY: AR, ES, TR
  2. TOP 3 HOURS: 20:21 - 21:22 - 19:20
  3. AVERAGE TRANSACTION STORE 12513: 17.38
  4. PARTITION:
    • TURNO 1: 00:00 - 07:59 - <1%
    • TURNO 2: 08:00 - 15:59 - 43-44%
    • TURNO 3: 16:00 - 24:00 - 66%

Modelling

¿Cuáles son los 3 paises en los que más pedidos se realizan?

Development

Before to proceed to the count of different order, we need to verify if in the 54330 observations, there are no duplicates

Once verified we can proceed to the count of the order_id grouped by country

Answer

The top 3 countries by order are: Argentina, Spain and Turkey

¿Cuáles son las horas en las que se realizan más pedidos en España?

Development

Answer

The busiest hours are around dinner time which goes around 19-20-21 with the timespam from 20:00:00 to 20:59:59 being the busiest

¿Cuál es el precio medio por pedido en la tienda con ID 12513?

Development

Answer

Qué porcentaje de repartidores pondrías por cada turno para que sean capaces de hacer frente a los picos de demanda.

Teniendo en cuenta los picos de demanda en España, si los repartidores trabajan en turnos de 8horas.

Qué porcentaje de repartidores pondrías por cada turno para que sean capaces de hacer frente a los picos de demanda. (ej: Turno 1 el 30%, Turno 2 el 10% y Turno 3 el 60%).

Development

Answer

Modelo predictivo de machine learning a partir del dataset 'train.csv'

Introduction

 Data Wrangling

Exploratory Data Analysis

Status payment vs. final status

Based on that, It looks that when the status of a transaction is NOT_PAID the probability of being cancelled increase. Hence It will be included in our model

To assess this information a chi-square test will be performed ( categorical vs categorical). If the H(0)=Indipendency cannot be rejected, the two variables will be considered dependent hence a variability in the final status can be expalined by the variability in the variable payment_statys

We can reject the null hypothesis and conclude there is a relationship between payment_status and final_status

Number of products vs final status

-Outliers check:

- Median centered around 2 
    - 62% of the transaction has 2 or 1 product
    - 89 % of the transaction has less or equal to 5 products
    -

- Distribution positevely skewed 

- Extreme values are expected according the right skewed distribution hence will not be removed

Percentage of delivered is constant until 13 products ( which account the 99%+ of the data) hence It does not look that product number can be used in our model since it does not explain any variability of our target variable. Moreover as can be noticed by the Point biserial test - used for testing the dependency between number of products and our target variable we have a 0 correlation with an realiable p-value

Delivery time vs final status

We can notice above that there is a different trend between the night orders (defines as order from minight until 6am) and the rest: in fact the probability that those are cancelled it's much higher. Based on this we will include Hour of delivery inside our model

Products total vs final status

The Probability Density plot is almost overallping amongst the 2 class of our dependent variable except for the tail

It might seems that the bigger is the total amount of the transaction, less is the probability to have our order completed. In fact it's much more likely that an order is delivered if the total amount is < 60 compared if the total amount is > 100. Unfortutanetly the quanitity of transactions that follow the latest pattern described does not represent a numerous sample inside our database.

Country vs final status

Most numerous transaction have different probability of being marked as delivered. Example:

Both together account around the 40% of the database hence they can potentially contribute to explain the variability of our depdentent variable.

Store address vs final status

The higher the number of transactions by shop, the higher the probability that the order is gonna be delivered - trend valid for shop whose number of transaction is higher of 60

Modelling

Feature Selection

Train Test split

Modelling

Random forest

Logistic regression with Smote

 Random Forest Classifier with Smote

Random Forest Classifier with SMOTE and RandomUnderSampler

AdaBoostClassifier

Gradient Boosting Classifier

Test set prediction

Summary